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Abstract. Hierarchical models are increasingly used in many appli- 
cations. Along with this increased use comes a desire to investigate 
whether the model is compatible with the observed data. Bayesian 
methods are well suited to eliminate the many (nuisance) parameters in 
these complicated models; in this paper we investigate Bayesian meth- 
ods for model checking. Since we contemplate model checking as a pre- 
liminary, exploratory analysis, we concentrate on objective Bayesian 
methods in which careful specification of an informative prior distribu- 
tion is avoided. Numerous examples are given and different proposals 
are investigated and critically compared. 

Key words and phrases: Model checking, model criticism, objective 
Bayesian methods, p-values, conflict, empirical-Bayes, posterior pre- 
dictive, partial posterior predictive. 



1. INTRODUCTION 

With the availability of powerful numerical com- 
putations, use of hierarchical (or multilevel, or ran- 
dom effects) models has become very common in ap- 
plications. They nicely generalize and extend 
standard one-level models to complicated 
situations, where these simple models would not ap- 
ply. With their widespread use comes along an in- 
creased need to check the adequacy of such mod- 
els to the observed data. Recent Bayesian methods 
(Bayarri and Berger (1999), 2000) have shown con- 
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siderable promise in checking one-level models, espe- 
cially in nonstandard situations in which parameter- 
free testing statistics are not known. In this paper 
we show how these methods can be extended to 
checking hierarchical models. We also review state- 
of-the-art Bayesian proposals for checking hierarchi- 
cal models and critically compare them. 

We approach model checking as a preliminary anal- 
ysis in that if the data are compatible with the as- 
sumed model, then the full (and difficult) Bayesian 
process of model elaboration and model selection 
(or averaging) can be avoided. The role of Bayesian 
model checking versus model selection has been dis- 
cussed, for example, in Bayarri and Berger (1999, 
2000) and O'Hagan (2003) and we will not repeat it 
here. 

In general, in a parametric model checking sce- 
nario, we relate observables X with parameters 
through a parametric model X | /(x | 9). We 
then observe data "Si-obs and wish to assess whether 
Xo6s are compatible with the assumed (null) model 
/(x I 6). Most of the existing methods for model 
checking (both Bayesian and frequentist) can be seen 
to correspond to particular choices of: 

1. A diagnostic statistic T, to quantify incompati- 
bility of the model with the observed data through 

tabs — T (x.Qf)g^ . 
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2. A completely specified distribution for the statis- 
tic, h{t), under the null model, in which to locate 
the observed tabs- 

3. A way to measure conflict between the observed 
statistic, and the null distribution, h{t), for T. 
The most popular measures are tail areas 
(p-values) and relative height of the density h{t) 

c^t tobs- 

In this paper, we concentrate on the optimal choice 
in item 2, which basically reduces to choice of meth- 
ods to eliminate the nuisance parameters 6 from 
the null model. Our recommendations then apply 
to any choices in 1 and 3. [We abuse notation and 
use the same h{-) to indicate both the completely 
specified distribution for X, after elimination of 6, 
and the corresponding distribution for T.] Of course, 
choice of 1 is very important; as a matter of fact, 
in some scenarios a "good" T can be chosen which 
is ancillary or nearly so, thus making choice of 2 
nearly irrelevant. So our work will be most rele- 
vant for complicated scenarios when such optimal 
T's are not known, or extremely difficult to imple- 
ment (for an example of these, see Robins, van der 
Vaart and Ventura (2000)). In these situations, T 
is often chosen casually, based on intuitive consid- 
erations, and hence we concentrate on these choices 
(with no implications whatsoever that these are our 
recommended choices for T; we simply do not ad- 
dress choice of T in this paper). Also, without loss of 
generality, we can assume that T has been defined 
such that the larger T is, the more incompatible data 
are with the assumed model. As measures of confiict 
in item 3 above, we explore the two best known mea- 
sures of surprise, namely the p- value and the relative 
predictive surprise, RPS (see Berger (1985), Section 
4.7.2) used (with variants) by many authors. These 
two measures are defined as 

(1.1) p=Pr'^(-)(i(X)>t(x„5,)), 

(1.2) iiP5 = M^%ll. 

Note that small values of (1.1) and (1.2) denote in- 
compatibility. 

Frequentist and Bayesian choices for h{-) are dis- 
cussed at length in Bayarri and Berger (2000), and 
we limit ourselves here to an extremely brief (and 
incomplete) mention of some of them. The natu- 
ral Bayesian choice for h{-) is the prior predictive 
distribution, in which the parameters get naturally 



integrated out with respect to the prior distribu- 
tion. (Box (1980) pioneered use of p-values com- 
puted in the prior predictive for Bayesian model crit- 
icism.) However, this requires a fairly informative 
prior distribution (see O'Hagan (2003) for a discus- 
sion) which might not be desirable for model check- 
ing for two reasons: (i) we might wish to avoid the 
careful (and difficult) prior quantification in these 
earlier stages of the analysis, since the model might 
well not be appropriate and hence the effort is wasted; 
(ii) most importantly, model checking with informa- 
tive priors cannot separate inadequacy of the prior 
from inadequacy of the model. 

In the sequel we will concentrate on objective 
Bayesian methods for model checking. We use the 
term objective to refer to Bayesian methods in which 
the priors are chosen by some default, agreed upon 
rules (objective priors) rather than refiecting gen- 
uine (subjective) prior information. This term is fre- 
quent among Bayesians (see, e.g., Berger, 2003, 2006) 
but its use is not without controversy. Objective 
priors are usually improper. Note that this impro- 
priety makes the prior predictive distribution unde- 
fined and hence not available for (objective) model 
checking. 

Guttman's (1967) and Rubin's (1984) choice for 
h{-) is the posterior predictive distribution, result- 
ing from integrating 6 out with respect to the pos- 
terior distribution instead of the prior. This allows 
use of improper priors, and hence of objective model 
checking. This proposal is very easy to implement 
by Markov chain Monte Carlo (MCMC) methods, 
and hence has become fairly popular in Bayesian 
model checking. However, its double use of the data 
can result in an extreme conservatism of the re- 
sulting p-values, unless the checking statistic is 
fairly ancillary (in which case the way to deal with 
the parameters is basically irrelevant). This con- 
servatism is shown to hold asymptotically in 
Robins, van der Vaart and Ventura (2000), and for 
finite n and several scenarios in, for example, Ba- 
yarri and Berger (1999, 2000), Bayarri and Castellanos 
(2001) and Bayarri and Morales (2003). Miscahbra- 
tion of posterior predictive measures is also docu- 
mented in Dahl (2006), Draper and Krnjajic (2006) 
and Hjort, Dahl and Steinbakk (2006); the double 
use of the data was noted in the discussion of 
Gelman, Meng and Stern (2003) (see, in particular. 
Draper (1996)). This is not meant in any way to im- 
ply that posterior predictive measures are without 
merit [see Gelman (2003) for a recent exposition of 



BAYESIAN CHECKING OF THE SECOND LEVELS OF HIERARCHICAL MODELS 



3 



their advantages and interpretation], only that they 
have to be interpreted in a different way: a posterior 
p-value equal to, say, 0.4 can not naively be inter- 
preted as compatibility with the null model in all 
problems. A small posterior predictive measure can 
safely be interpreted as incompatibility with the null 
model. 

Alternative choices of h{-) for objective model 
checking are proposed in Bayarri and Berger (1997, 
1999, 2000). Their asymptotic optimality is shown in 
Robins, van der Vaart and Ventura (2000). In this 
paper we derive these marginals for hierarchical 
model checking. We also compare the results with 
those obtained with posterior predictive distribu- 
tions and several "plug-in" choices for h{-). Note 
that "plug-in" p-values would be natural choices for 
frequentist checking when interpreting the second 
level of a hierarchical model as a "random effect," 
so in particular, we compare some popular choices 
of Bayesian p-values with MLE "plug-in" p-values. 

There are not many proposals for checking the dis- 
tributional assumption of "random effects." Along 
with the mentioned methods, we also carefully re- 
view state-of-the-art Bayesian proposals, 
namely (i) the simulation-based checking of 
Dey, Gelfand, Swartz and Vlachos (1998), a compu- 
tationally intensive method based on Monte Carlo 
tests, (ii) the O'Hagan method (O'Hagan (2003)) 
for checking graphical models, and (iii) the conflict 
p-values of Marshall and Spiegelhalter (2003), close 
in spirit to cross-validation methods. We critically 
compare these methods in several examples. In this 
paper most attention is devoted to the checking of 
a fairly simple normal-normal hierarchical model so 
as to best illustrate the different proposals and criti- 
cally judge their behavior. Of course, the main ideas 
also apply to the checking of many other hierarchi- 
cal models. In Section 2 we briefly review the dif- 
ferent measures of surprise (MS) that we will de- 
rive and compare. In Section 3 we derive these mea- 
sures for the hierarchical normal-normal model. We 
also study the sampling distribution of the different 
p-values, both when the null model is true, and when 
the data come from alternative models. In Section 4 
we apply these measures to a particular simple test 
which allows easy and intuitive comparisons of the 
different proposals. In Section 5 we briefly summa- 
rize other methods for Bayesian checking of hierar- 
chical models, namely those proposed by 
Dey, Gelfand, Swartz and Vlachos (1998), O'Hagan 



(2003) and Marshall and Spiegelhaher (2003), com- 
paring them with the previous proposals in an exam- 
ple. Finally, in Section 6 we check the adequacy of 
a binomial/beta hierarchical model in a well-known 
example using all of the methods reviewed in the 
paper. 

2. MEASURES OF SURPRISE IN THE 
CHECKING OF HIERARCHICAL MODELS 

In this paper we will be dealing with the MS de- 
fined in (1.1) and (1.2). Their relative merits and 
drawbacks are discussed at length in Bayarri and 
Berger (1997, 1999) and will not be repeated here. 
In this section we derive these measures in the con- 
text of hierarchical models, and for some specific 
choices of the completely specified distribution h{-). 
We consider the general two-level model: 

Xij I 9i f{xij I 9i), i = 1, . . . = 1, . . . ,ni, 
/ 

e\r]'''r^-7r{e\rj) = l[7Tie,\r,), 

1=1 

rj ~ 7r(?7), 

where 6 = (9i, . . . ,9j) and rj = (r]i, . . . , rjp) 

To get a completely specified distribution h{-) for 
X, we need to integrate out from /(x | 6) with re- 
spect to some completely specified distribution for 
6. We next consider three possibilities that have 
been proposed in the literature for such a distribu- 
tion: empirical Bayes types (plug- in), posterior dis- 
tribution, and partial posterior distribution, as they 
apply in the hierarchical scenario. Notice that, since 
we will be dealing with improper priors for ry, the 
natural (marginal) prior tt{0) is also improper and 
cannot be used for this purpose [it would produce 
an improper h{-)]. 

2.1 Empirical Bayes (Plug-In) Measures 

This is the simplest proposal, very intuitive and 
frequently used in empirical Bayes analysis (see, e.g., 
Carlin and Louis (2000), Chapter 3). It simply con- 
sists in replacing the unknown 77 in 7r(0 | ry) by an 
estimate (we use the MLE, but moment estimates 
are often used as well). In this proposal, 9 is inte- 
grated out with respect to 

(2.1) 7r^^(0)=7r(0|T7) = ^(0|r7 = T7), 

where rj maximizes the integrated likelihood: 

/(x|77) = j f{^\e)7Tie\r])de. 
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The corresponding proposal for a completely speci- 
fied h{-) in which to define the MS is 

(2.2) m^,lAt) = I f{t\0)7r''^i9)d9. 

The MS Pp^or ^^'^ prior ^'^^ '^o'^ given by (1.1) 
and (1.2), respectively, in which /i(-) = mp^^j.{-). 

Strictly for comparison purposes, we will later use 
another distribution which is also of the empirical 
Bayes type; in this new distribution, the empirical 
Bayes prior (2.1) gets needlessly (and inappropri- 
ately) updated using again the same data. In this 
(wrong) proposal, 6 gets integrated out with respect 
to 

(2.3) 7r^^(0 I Xobs) oc /(xo5s I 0)vr^''(0), 
resulting in 

(2.4) 7n^„^,(t) = Jf{t\ ey'^^ie I x„,,) dO. 

The corresponding MS Pp^gt and RPSp^^f- are again 
computed using (1.1) and (1.2), respectively, with 
h{-)=m%'lt{t). 

2.2 Posterior Predictive Measures 

This proposal is also intuitive and seems to have 
a more Bayesian "flavour" than the plug-in solution 
presented in the previous section. This along with 
its ease of implementation has made the method a 
popular one for objective Bayesian model checking. 
This popularity makes investigation of its properties 
all the more important. The idea is simple: use the 
posterior to integrate 6 out. Assuming that the pos- 
terior is proper (as usual), this allows model check- 
ing when 7r(j7) [and hence vr(0)] is improper. Thus, 
the proposal for h{-) is the posterior predictive dis- 
tribution 

(2.5) rupostit I Xobs) = / -^(^ I ^)^(^ I ^obs) dO, 

where 7r{9 \ Xobs) is the marginal from the joint pos- 
terior 

Tr{6,r]\ Xobs) oc /(x^fe^ | e)Tr{0,r]) 

I 

= f{^obs\0)7r{Ti)l[7T{e,\ri). 

1=1 

The posterior p-value and the posterior RPS are 
denoted by Ppost and RPS post, and computed from 
(1.1) and (1.2), respectively, with h{-) = mpost{-)- 

It is important to remark that, under regularity 
conditions, the empirical Bayes posterior tt^^ {6 \ 



Xobs) given in (2.3) approximates the true posterior 
Tr{0 I Xo6s). Both are, in fact, asymptotically equiva- 
lent. Hence whatever inadequacy of 'nip^g^{t) in (2.4) 
for model checking is likely to apply as well to the 
posterior predictive rUpostit \ ^obs) in (2.5). We will 
see demonstration of the similar behavior of both 
predictive distributions in all the examples in this 
paper. Use of posterior predictive measures was in- 
troduced by Guttman (1967) and Rubin (1984) and 
extended and formalized in Gelman, Meng and Stern 
(2003). They are very easy to compute and they are 
perhaps the most widely used checking procedure. 
We refer to Meng (1994), Gelman, Meng and Stern 
(2003) and Gelman (2003) for extended discussion 
and motivation. 

2.3 Partial Posterior Predictive Measures 

Both the empirical Bayes and posterior proposals 
presented in Sections 2.1 and 2.2 use the same data 
to (i) "train" the improper tt{9) into a proper dis- 
tribution to compute a predictive distribution and 
(ii) compute the observed tabs to be located in this 
same predictive distribution through the MS. This 
can result in a severe conservatism incapable of de- 
tecting clearly inappropriate models. A natural way 
to avoid this double use of the data is to use part 
of the data for "training" and the rest to compute 
the MS, as in cross-validation methods. The pro- 
posal in Bayarri and Berger (1999, 2000) is similar 
in spirit: since tghs is used to compute the surprise 
measures, it uses the information in the data not 
in tobs to "train" the improper prior into a proper 
one. A natural way to "remove" the information in 
tobs = T(X. = Xobs) from Xobs is by conditioning in 
the observed value of the statistic T(X.); that is, us- 
ing the conditional distribution f{^obs I tobs-,^) in- 
stead of f{xobs I d) to define the likelihood. The re- 
sulting posterior distribution for 6 (assumed proper) 
is called a partial posterior distribution and given by 

'Kppp{6 I yiobs \ tobs) « /(Xo5s I tobs,(^)'^{d) 

fi^obs I e)TT{e) 

OC . 

fitobs I ^) 

The corresponding proposal for the completely spec- 
ified h{-) is then the partial posterior predictive dis- 
tribution computed as 

rripppit I Xobs \ tobs) = j f{t\ 0)ir{e \ 

^obs \ tobs)dO. 

The partial posterior predictive measures of surprise 
will be denoted by pppp and RPSppp and, as before. 
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computed using (1.1) and (1.2), respectively, with 



h{-) 



■ m 



ppp 



(•)■ 



Extensive discussions of the advantages and dis- 
advantages of this proposal as compared with the 
previous ones can be found in Bayarri and Berger 
(2000) and Robins, van der Vaart and Ventura 
(2000). In this paper we demonstrate their perfor- 
mance in hierarchical models. 

2.4 Computation of Ph(-) aid RPSh(-) 

Often, for a proposed /i(-), the measures and 
-RP5'/j(.) cannot be computed in closed form. In fact, 
h{-) is often not of closed form itself. In these cases 
we use Monte Carlo (MC), or Markov Chain Monte 
Carlo (MCMC) methods, to (approximately) com- 
pute them. If x^, . . . ,x*^ is a sample of size M gen- 
erated from ^(x), then ti = t(x*) is a sample from 
h{t), and we approximate the MS as: 



1. p- value 

2. relative predictive surprise 



# oi ti> tphs 
M ' 



RPS 



h{.) 



supt h{t) ' 



where h{t) is an estimate (for instance a kernel 
estimate) of the density h at t. When the distri- 
bution of the test statistic T, fxit \ 9), has closed 
form expression, one can avoid kernel estimation 
by using a "Rao-Blackwellized" Monte Carlo es- 
timate of h, that is, h{t) = (1/m) J2T=i hit \ Ok), 
where the 0fc's are draws from the appropriate 
distribution for 6 (proper prior, posterior, par- 
tial posterior, . . .). This is the method used in 
the examples of this paper and was pointed to us 
by a referee. 

3. CHECKING HIERARCHICAL NORMAL 
MODELS 

Consider the usual normal-normal two-level hier- 
archical (or random effects) model with / groups 
and fij observations per group. The I means are as- 
sumed to be exchangeable. For simplicity, we be- 
gin by assuming the variances af at the observation 
level to be known. The model is 



Xij I Oi 



(3.1) 



i = l,...,I,j 



7r{0 \ fi,T) = II N{ei\fi,T^), 

1 



i=l 

„/..\„/_2n 



7r(^,r ) = tt{ij)tt{t ) oc 

7 

In this paper we concentrate on checking the ade- 
quacy of the second-level assumptions on the means 
Oi. Of course, checking the normality of the obser- 
vations is also important, but it will not be consid- 
ered here. The techniques considered in this paper 
as applied to the checking of simple models have 
been explored in Bayarri and Castellanos (2001), 
Castehanos (1999) and Bayarri and Morales (2003). 

Assume that choice of the departure statistic T 
is done in a rather casual manner, and that we are 
especially concerned about the upper tail of the dis- 
tribution of the means. In this situation, a natural 
choice for T is 



(3.2) 



T = max{Xi., . . . , X/.}, 



where Xi. denotes the usual sample mean for group i. 
This T is rather natural, but the analysis would be 
virtually identical with any other choice. Recall that 
if the statistic is fairly ancillary, then the answers 
from all methods are going to be rather similar, no 
matter how we integrate 6 out. 

The density of the statistic (3.2) under the (null) 
model specified in (3.1) can be computed to be 



(3.3) 



f^{t\e) = j^N(t\dkA) 



1=1 ^ ""l 



where N{t \ a, b) and F{t \ a, b) denote the density 
and distribution function, respectively, of a normal 
distribution with mean a and variance b evaluated 
at t. 

We next integrate the unknown 6 from (3.3) using 
the techniques outlined in Section 2. 

3.1 Empirical Bayes Distributions 

It is easy to see that the likelihood for /x and 
is simply 



(3.4) /(x|/.,r2) = niV 



i=l 



Xi 
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from which (1 and can be computed. Then (2.1) 
is given by 



T,^^{e) = T,{e\(i,f^) = X{N{e, 



fJ,,T 



i=l 



which we use to integrate 6 out from (3.3). The re- 
sulting ?Tip^Q^(x) does not have a closed form ex- 
pression, but simulations can be obtained by sim- 
ple MC methods. For comparison purposes, we will 
also consider integrating 9 w.r.t. the (inappropri- 
ate) empirical Bayes posterior distribution. The re- 
sulting mp^((x) is also trivial to simulate from by 
using a similar MC scheme. Details are given in Ap- 
pendix A. 

3.2 Posterior Predictive Distribution 

This proposal integrates 6 out from (3.3) w.r.t. its 
posterior distribution. For the noninformative prior 
tt{i_l,t'^) oc 1/t, the joint posterior is 

7rpost(0,/U,T^|Xo5s) 

(3.5) oc /(x I e, fi, T^)n{e I /i, r2)7r(/i, r^) 

1=1 ^ ' ^ 1=1 

To simulate from the resulting posterior predic- 
tive distribution niposti^ I ^obs), we first simulate 
from TTposj (0, /X, r^|xo5s) and for each simulated 0, 
we simulate x from f{x\0). To simulate from the 
joint posterior (3.5) we use an easy Gibbs sampler 
defined by full conditionals given in Appendix B. 

3.3 Partial Posterior Distribution 

To simulate from the partial posterior predictive 
distribution, ruppp, we proceed similarly to 
Section 3.2, except that simulations for the param- 
eters are generated from the partial posterior distri- 
bution: 



ppp y 1 T I ^ofes \ ^o6s) oc 

where iTpostiO, fi,T'^ \ Xobs) 
are given in Appendix C. 



f{tobs I ^) 

is given in (3.5). Details 



3.4 Examples 

For illustration, we now compute the MS, that is, 
the p-values and the relative predictive surprise in- 
dexes for the different proposals. We use a couple of 
data sets with five groups and eight observations in 
each group. In both of them the null model is not the 



model generating the data; in Example 1 one of the 
means comes from a different normal with a larger 
mean, whereas in Example 2 the means come from 
a Gamma distribution. Recall that the null model 
(3.1) had the group means i.i.d. normal. 

Example 1. The group means are 1.56, 0.64, 
1.98, 0.01, 6.96, simulated from 

Xijr^NiOiA), i = l,...,5,j = l,...,8, 

0i^N{l,l), i = l,...,4, 

^5~iV(5,l). 

Example 2. The group means are: 0.75, 0.77, 
5.77, 1.86, 0.75, simulated from 



Oi ~ ^0(0.6, 0.2), 



l,...,5,j 
1,...,5. 



1 



In Table 1 we show all MS for the two examples. 
The partial posterior measures clearly detect the in- 
adequate models, with very small p- values and RPS. 
On the other hand, none of the other predictive dis- 
tributions work well for this purpose, no matter how 
we choose to locate the observed to6s in them (with 
p- values or RPS). The prior empirical Bayes are 
conservative, with p and RPS an order of magni- 
tude larger than the ones produced by the partial 
posterior predictive distribution. Both the posterior 
empirical Bayes and predictive posterior measures 
are extremely conservative, indicating almost per- 
fect agreement of the observed data with the quite 
obviously wrong null models. Besides, it can be seen 
that EB posterior and posterior predictive measures 
are very similar to each other. This is not a specific 
feature of these examples, but occurs very often. We 
further explore it in a rather simple null model in 
Section 4. 

We next study the behavior of the different p- values, 
when considered as a function of X, under the null 
and under some alternatives. 

3.5 Null Sampling Distribution of the p-Values 

In Section 2, we have reviewed four different ways 
to define (Bayesian) p-values for model checking. To 
compare their performance, we should address the 
question of what do we want in a value. 

For frequentists, one appealing property of p- values 
is that, when considered as random variables, p(X) 
have ?7(0,1) distributions under the null models. 
This endorses p-values with a very desirable prop- 
erty, namely having the same interpretation across 
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Table 1 

p-values and RPS for Examples 1 and 2 





f prior 


T% -paEB 


Ppost 


X? tdqEB 

'J post 


Ppost 


RPS post 


Pppp 


RPS ppp 


Ex. 1 


0.13 


0.28 


0.35 


0.93 


0.41 


0.97 


0.01 


0.01 


Ex. 2 


0.12 


0.29 


0.30 


0.88 


0.38 


0.95 


0.01 


0.01 



problems. Statistical measures that lack a common 
interpretation across problems are simply not very 
useful. (For more extensive discussion of this point, 
see Robins, van der Vaart and Ventura (2000).) In 
fact, the uniformity of p-values has often been taken 
as their "defining" characteristic (Meng (1994); 
Rubin (1996); De la Horra and Rodriguez-Bernal 
(1997); Thompson (1997); Robins (1999); 
Robins, van der Vaart and Ventura (2000)). For 
most problems, exact uniformity under the null for 
all 6 cannot be attained for any p-value. Thus one 
must weaken the requirement to some extent. A nat- 
ural weaker requirement is that a p- value be C/(0, 1) 
under the null in an asymptotic sense (see 
Robins, van der Vaart and Ventura (2000)). As an 
aside, it should be remarked that uniformity of 
p-values is an essential assumption for some anal- 
yses based on p-values, as some popular algorithms 
for handling multiplicities (see Cabras (2004)). 

It is not obvious that Bayesians should be con- 
cerned with establishing that a p-value is uniform 
under the null for all 9. For instance, when the prior 
is proper, the prior predictive p- value is U{0, 1) un- 
der m(x), which means it is U{0,1) in an average 
sense over 6. If the prior distribution is chosen sub- 
jectively, a Bayesian could well argue that this is 
sufficient. Indeed Meng (1994) suggested that uni- 
formity under m(x) is a useful criterion for the eval- 
uation of any proposed (Bayesian) p-value. 

If the prior is improper, however (as it is often 
the case in objective Bayes model checking, the sub- 
ject of this paper), then this prior predictive uni- 
formity criterion cannot be used. Of course, if a 
p-value is uniform under the null in the frequen- 
tist sense, then it has the strong Bayesian property 
of being marginally U{0, 1) under any proper prior 
distribution. This explains why Bayesians should, 
at least, be highly satisfied if the frequentist re- 
quirement obtains. Perhaps more to the point, if 
a proposed p-value is always either conservative or 
anticonservative in a frequentist sense (see 



Robins, van der Vaart and Ventura (2000), for defi- 
nitions), then it is likewise guaranteed to be conser- 
vative or anti-conservative in a Bayesian sense, no 
matter what the prior. Interesting related discus- 
sion concerning the posterior predictive p-value can 
be found in Meng (1994), Gelman, Meng and Stern 
(2003), Rubin (1996), Gelman (2003), Dahl (2006) 
and Hjort, Dahl and Steinbakk (2006). There is a 
vast literature on other methods of evaluating 
p-values. Further discussion and references can be 
found in Bayarri and Berger (2000). 

Here, we focus on studying the degree to which 
the various p-values deviate from uniformity in finite 
sample scenarios. For this purpose, we simulate the 
null sampling distribution of p^^or 0^) > Ppost (X) and 
Pppp (X) , when X comes from a hierarchical normal- 
normal model as defined in (3.1). [We do not show 
the behavior of Pp^j(X) because it is basically iden- 
tical to that of Ppost (X).] 

In particular, we have simulated 1000 samples from 
the following model: 

X,,~iV(0„4), i = l,...,/,j = l,...,8, 

9ir^N{0,l), i=l,...,I. 

We have considered three different "group sizes": 
I = 5, 15 and 25. (Since here we are checking the dis- 
tribution of the means, the adequate "asymptotics" 
is in the number of groups.) 

We compute the different p-values for 1000 sim- 
ulated samples. Figure 1 shows the resulting his- 
tograms. As we can see, ppj,p(X) has already a 
(nearly) uniform distribution even for / (number of 
groups) as small as 5. On the other hand, the dis- 
tributions of both PpriorOQ ^^'^ PpostOQ are quite 
far from uniformity, the distribution of ppost(X) be- 
ing the farthest. Moreover, the deviation from the 
U{0, 1) is in the direction of more conservatism (given 
little probability to small p-values, and concentrat- 
ing around 0.5), as it is the case in simpler models. 
Notice that conservatism usually results in lack of 
power (and thus in not being able to detect data 
coming from wrong models). Particularly worrisome 
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is the behavior of Ppost (X) for small number of groups 
We have also performed similar simulations for larger 
I's (number of groups) to investigate whether the 
distribution of these p- values approaches uniformity 
as / grows. In Figure 2 we show the histograms 
for / = 100 and I = 200 of values Ppost (X) and 
Pprior (-^) ['^^ '^ot show pppp (X) as it is virtually 
uniform] . The distributions of these p- values do not 
seem to change much as / is doubled from / = 100 
to I = 200, and they are still quite far from unifor- 
mity, still showing a tendency to concentrate around 
middle values for p. We do not know whether these 
p- values are asymptotically U{0,1). 

3.6 Distribution of p- Values Under Some 
Alternatives 

In this section we study the behavior of 
p^^fo^(X), Ppost (X) and Pppp(X), when the "null" 
normal-normal model is wrong. In particular, we fo- 
cus on violations of normality at the second level. 

Specifically, we simulate data sets from three dif- 
ferent models. In all the three, we take the distribu- 
tion at the first level to be the same and in agree- 
ment with the first level in the null model (3.1): 

~ N{9i, a^ = 4), i = 1, . . . , /,i = 1, . . . , 8. 

We use three different distributions for the group 
means (remember, under the null model, the 0j's 
were normal): 

1. Exponential distribution: 9i ~ Exp(l),i = 1, 

2. Gumbel distribution: 6i ~ Gumbel(0, 2), i = 1, 
...,/, where the Gumbel(a, (3) density is 

/(x|a,/3) = iexp(-^) 

.exp(-exp(-;^)), 

for — oo < X < oo. Gumbel distribution is also 
known as Extreme Value Type I distribution. It 
is skew, with a long tail to the right (left) when 
derived as the limiting distribution of a maximum 
(minimum). 

3. Log-Normal distribution: ~ LogNormal(0, 1), 
i = l,...J. 

We have considered 1 = 5 and / = 10, simulated 
1000 samples from each model and computed the 
different p-values for each sample. In Table 2 we 
show Pr{p < a) for the three p-values and some val- 
ues of a. Pppp seems to show decent power given the 



Table 2 

Pr{p < a) for Pppp, Ppost and Pprion for different values of I 
and the three alternative models 



a 0.02 0.05 0.1 0.2 0.02 0.05 0.1 0.2 

Normal- Exponential 







1 = 


5 






J = 


10 




Pppp 


0.04 


0.08 


0.15 


0.24 


0.12 


0.20 


0.29 


0.42 


Ppost 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.05 


y prior 


0.00 


0.00 


0.00 


0.23 


0.00 


0.06 


0.18 


0.37 



Normal- Gumbel 







1 = 


5 






J = 


10 




Pppp 


0.12 


0.22 


0.32 


0.46 


0.21 


0.31 


0.42 


0.55 


Ppost 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


yprior 


0.00 


0.00 


0.00 


0.23 


0.00 


0.07 


0.19 


0.38 



Normal-Lognormal 







1 = 


5 






/ = 


10 




Pppp 


0.16 


0.22 


0.31 


0.41 


0.32 


0.42 


0.50 


0.61 


Ppost 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.02 


y prior 


0.00 


0.00 


0.00 


0.23 


0.01 


0.06 


0.13 


0.23 



small sample sizes and number of groups (power is 
lower for the exponential alternative, and largest for 
the log- normal); both Pp^ior ^'^d Ppost show consider- 
able lack of power in comparison. In particular, no- 
tice the extreme low power of Ppost in ah instances, 
producing basically no p-values smaller than 0.2. 

4. TESTING ^l = ^lo 

As we have seen in Section 3, the specified predic- 
tive distributions for T (empirical Bayes, posterior 
and partial posterior) used to locate the observed 
tabs had to be dealt with by MC and MCMC meth- 
ods. To gain understanding in the behavior of the 
different proposals to "get rid" of the unknown pa- 
rameters, we address here a simpler "null model" 
which results in simpler expressions and allows for 
easier comparisons. 

Suppose that we have the normal-normal hierar- 
chical model (3.1) (with af known) but that we are 
interested in testing 

A natural T to consider to investigate this Hq is the 
grand mean: 

where Xi., i = 1, . . . ,1 , are the sample means for the 
/ groups. The (null) sampling distribution of T is: 

T\e^N{fiT,VT) 
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I 1 T 1 1 1 

0,0 0,2 OA 0.6 0,8 1,0 




1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 



1 J 


Itr 







0.0 0.2 0,4 0,6 0,e 1.0 
b1S 




1 1 1 r 

0.0 0.2 0.4 0.6 O.S t.O 

1=25 




0,0 0.2 0,4 0,6 O.e 1,0 
Pew*'') 




0.0 0.2 0.4 0.6 
Ppw(X) 




I 1 1 [ 1 1 

0.0 0.2 0,4 0.6 OS 1.0 




1 1 1 1 

0.0 0.2 0.4 0.6 0,e 1.0 



Pposl(>t) 



v> 
c 

IB 

a 




0,0 0,2 0.4 0,6 



Fig. 1. Null distribution of Pp^^^CX.) (first column), ppost(X) (second column) and pppp{X.) (third column) for I = 5 (first 
row), 15 (second row) and 25 (third row). 

(4.1) 



witn = -^^7 5 ■ 



Hi{^obs))/Kl^o) 



Again we will integrate 9 out from (4.1) with the 



(4.3) RPS 

supth{t)/h{ido) 
4.1 Empirical Bayes Distributions 

In this case the integrated likelihood for is sim- 
previous proposals and compare the resulting pre- ply given by (3.4) with // replaced by /xq, from which 
dictive distributions for T, h{t), and the correspond- the MLE of can be computed. For this nuh 
ing MS (which we take relative to ^o): model, it is possible to derive closed form expres- 

sions for the prior and posterior empirical Bayes dis- 
(4.2) p = Pr^(-)(|t(X) -^ol > |i(x obs) ~/^o|)) tributions given in (2.2) and (2.4), respectively. 
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1=100 



1=100 



V) 

c 
O 



in 
o 




I T" 

0,0 0.2 



— I T" 

0,4 0.6 



— I 1 

0,8 1,0 



O 




I r- 

0.0 0,2 



— I \ 1— 

0.4 0,6 o.e 



1,0 



1=200 



1=200 



ID 



o 



p 




I 1 ; r 



T 



0-2 4 0-6 0-8 1.0 



c 

a 



in 

o 



d 




I [ 1 \ 1 1 

0-0 2 0-4 0,6 0-8 1 



Ppost(^) 



Fig. 2. Null distribution of p™„^{}C) and Ppost{X.) when I — 100 (first row) and 1 = 200 (second row). 



Indeed, the joint empirical Bayes prior predictive (4.4) 
for X = is ^^EB 



EB 



prior 



{5t) = Y[N(x,. 



i=l 



prior NO 



i=l 



The empirical Bayes posterior predictive distribu- 
tion m^j(x) can be derived in a similar manner 
so that the corresponding distribution for T, resulting also in a normal with mean and 

PR / \ • 1-1 1 ■ • posi \ / 

m^^^^(t), IS normal with mean and variance given variance 

by 



E. 



EB 
prior 



EB _ ^i=l ^jEj 
post ~ sr^I 

2^1=1 ^i 
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(4.5) 



where 



and 



EB 



post 



m/af + 1/f 2 



Vi 



The MS (4.2) and (4.3) can also be computed in 
closed form. The (prior) empirical Bayes measures 
are 



if 'prior 



2-1-$ 



\tohs — /^o| 



' prior 



RPS 



EE 
prior 



■ exp 



jtobs — fJ-oY 
prior 



where $ denotes the standard normal distribution 
function. The posterior empirical Bayes measures 
can similarly be derived in closed form, but they are 
of much less interest and we do not produce them 
here (see Castellanos (2002)). 

The inadequacies of nipost fo'^ testing the null model 
can already be seen in the above formulas, but they 
are more evident in the particular homoscedastic, 
balanced case: af = cr^ and rij = n Vi, i = 1, . . . , /. In 
this case the distribution of T simplifies to 



Tr^N 



I 



In 



Also, there is a closed form expression for the MLE 
of T^: 

.2 T.l=l{Xi.-m? 

r = max< 0, ; 

y I n 

Then, the mean and variance of rrip^^g^, as given 
in (4.4), are 



prior 



Mo, K 



EE 
prior 



£l + f2 

•n ' 



Similarly, the mean and variance of m^^ 



in (4.5), reduce to 



post, given 



E. 



EE 
post 



V, 



EE 
post 



n/a^ + l/f2 
n/(nf2 + (t2) 



For a given (iq (and fixed r), it is now easy to inves- 
tigate the behavior of rrip^^gj. and m^^j as t^bs — > oo, 
indicating flagrant incompatibility between the data 
and Hq. The comparison in this simple case is en- 
lightening. First, note that rrip^g^. centers at /Uq, 
which in principle allows for declaring incompatible 
a very large value tabs] however, the variance also 
grows to oo as tabs grows, thus alleviating the incom- 
patibility, and maybe "missing" some surprisingly 
large tabs- Thus, the behavior of m^^-^^. is reason- 
able, but might be conservative. On the other hand, 
the behavior of m^^g^ is completely inadequate. In- 
deed, for very large tabs-, it centers precisely at tobs-, 
thus precluding detecting as unusual any value tobs-, 
no matter how large! Moreover, the variance goes to 
(2(T2)/(n/), a finite constant. It is immediate to see 
that nipost should not be used to check this partic- 
ular (and admittedly simple) model; as a matter of 
fact, for tobs — > oo (extremely inadequate models) we 
expect p-values of around 0.5. We remark that the 
previous argument does not belong to any particu- 
lar MS; rather it reflects the inadequacy of rripost foi' 
model checking, whatever MS we use. Note that we 
expect similar inadequacies to occur with the pos- 
terior predictive distribution, which is rather often 
used in objective Bayes model checking. 

4.2 Posterior Distribution 

No major simplifications occur for this specific 
Hq. The posterior distribution is not of closed form 
(not even for the homoscedastic, balanced case), and 
hence neither is the posterior predictive distribution. 
We can, however, easily generate from it with vir- 
tually the same Gibbs sampler used in Section 3.2: 
it suffices to (obviously) ignore the full conditional 
for /X and replace /x with the value /xq in the other 
two full conditionals (B.2) and (B.3), which were 
standard distributions. 

4.3 Partial Posterior Distribution 

There is no closed form expression for the partial 
posterior distribution either, but considerable sim- 
plification occurs since the Metropolis-within-Gibbs 
step is no longer needed and a straight Gibbs sam- 
pler suffices. The full conditional for is as given 
in (C.2) with /i replaced by ^q! the full conditional 
of each 9i is here also normal: 



vr 



,6>_ 



obs \ tobs) = N{9i 



Ef,VP) 
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where 




(4.6) 



V j i^i ) ) 



1 



(47) — = ^ 



1 



1 

+ — • 



Details of the derivations appear in Appendix D. 

4.4 Some Examples 

We next consider four examples in which we carry 
out the testing Hq : fi = 0. We consider / = 8 groups, 
with n = 12 observations per group, and cr^ = 4. In 
one of the examples (Example 1) Hq is true and the 
means 6i are generated from a A^(0, 1). In the re- 
maining three examples the null Hq is wrong, with 
9i ~ iV(1.5, 1) in Example 2, Oi ~ 7V(2.5, 1) in Exam- 
ple 3, and 9i ~ A^(2.5, 3) in Example 4. The simu- 
lated sample means are: 

Example 3. 

X = (-2.18, -1.47, -0.87, -0.38, 

0.05,0.29,0.96,2.74). 

Example 4. 

X = (-0.05, 0.66, 1.37, 1.70, 1.72, 2.14, 2.73, 3.68). 
Example 5. 

X = (1.53, 1.65, 1.71, 1.75, 1.87, 2.16, 2.47, 3.68). 
Example 6. 

x = (0.50, 1.52, 1.59, 2.73, 2.88, 3.54, 4.21, 5.86). 

In Figure 3 we show the predictive distributions 
for all proposals in the four examples. A quite re- 
markable feature is that in every occasion, 
basically coincides with mpost, so much that they 
can hardly be told apart. We were expecting them 
to be close, but not so close. Also, when the null is 
true (Example 1), all distributions rightly concen- 
trate around the null and, as expected, the most 
concentrated is m^^^ (and mpost), and the least is 
ruppp {rrip^^gj. ignores the uncertainty in the estima- 
tion of r^). When the null model is wrong, however, 
even though both rUppp and m^^^^^. have the right lo- 



cation, nippp is more concentrated than rrip^^j., thus 
indicating more promise in detecting extreme tabs- 
Notice the hopeless (and identical) behavior of rrip^^^ 
and mpost- both concentrate around tots, no matter 
how extreme; that is, there is no hope that it can 
detect incompatibility of a very large tobs with the 
hypothetical value of 0. 

In Table 3 we show the different MS for the four 
examples. All behave well when the null is true, but 
only the ppp and the prior empirical Bayes measures 
detect the wrong models (ppp more clearly) . On the 
other hand, rra^^^ and mpost produce very similar 
measures and both are incapable of detecting clearly 
inappropriate null models. Notice that the conser- 
vatism of the posterior predictive measures (and the 
posterior empirical Bayes ones) is extreme. 

5. A COMPARISON WITH OTHER BAYESIAN 
METHODS 

In this section we retake the main goal of checking 
the adequacy of the second level in the hierarchical 
model: 



Xi,\ei^N{e,,a''), 



,n,- 



7Tie\l^,T) = YlNi9,\f,,T^), 

1=1 

with cr^ unknown, as well as /z,r^. We first pro- 
vide some details needed to derive the MS used 
so far when o"^ is unknown; we then briefly review 
three recent methods for Bayesian checking 
of hierarchical models, proposed in 
Dey, Gelfand, Swartz and Vlachos (1998), O'Hagan 
(2003) and Marshah and Spiegelhalter (2003). We 
do not specifically address here (because the philos- 
ophy is somewhat different) the much earlier, likeli- 
hood/empirical Bayes proposal of Lange and Ryan 
(1989), which basically consists in checking the nor- 
mality of some standardized version of estimated 

Table 3 

p-values and RPS for testing ^ = m the four examples 





Example 1 


Example 2 


Example 3 


Example 4 




p RPS 


p RPS 


p RPS 


p RPS 


PPP 


0.86 0.98 


0.01 0.01 


0.00 0.00 


0.00 0.01 


EB prior 


0.83 0.98 


0.02 0.06 


0.01 0.03 


0.01 0.05 


EB post 


0.71 1.00 


0.31 0.89 


0.30 0.88 


0.38 1.00 


post 


0.71 1.00 


0.33 0.92 


0.32 0.95 


0.39 1.00 



BAYESIAN CHECKING OF THE SECOND LEVELS OF HIERARCHICAL MODELS 



13 



Example 1 



Example 2 



c 
O 





Example 3 



Example 4 



in 
c 

O 













ci " 




— mppp . 






in _ 




Pf*W 


■ ■ nfipwi 

■I- 




q _ 




















d ~ 




y / 


1 \ 








^ / 


\ ^ 




o 




J* / 
^ y 


\ ^ 




d ~ 












1 

-3 


1 1 1 

-£ -1 


1 1 

1 2 


1 

3 




Fig. 3. Different predictive distribution for T in each example. The vertical solid line locates tabs- The curves corresponding 
to rUpost and m^g^i were almost indistinguishable and for clarity are represented as identical. 



residuals. We apply the four methods considered so 
far and the three new methods to a data set pro- 
posed in O'Hagan (2003). 

O'Hagan (2003) EXAMPLE. In the general sce- 
nario of checking the normal-normal hierarchical 
model, O'Hagan (2003) uses the following data set: 



Group 1 


2.73, 


0.56, 


0.87, 


0.90, 


2.27, 


0.82. 


XI. = 


1.36. 


Group 2 


1.60, 


2.17, 


1.78, 


1.84, 


1.83, 


0.80. 


X2.^ 


1.67. 


Group 3 


1.62, 


0.19, 


4.10, 


0.65, 


1.98, 


0.86. 


X-i.^ 


1.57. 


Group 4 


0.96, 


1.92, 


0.96, 


1.83, 


0.94, 


1.42. 


X4,.— 


1.34. 


Group 5 


6.32, 


3.66, 


4.51, 


3.29, 


5.61, 


3.27. 


X5.= 


4.44. 



Note that X5. is considerably far from the other 
four sample means. 

5.1 Methods Used So Far 

The empirical Bayes methods (both the prior and 
the posterior) have an easy generalization to the un- 
known (T^ case. It suffices to substitute a"^ by its 
usual MLE estimate and apply the procedures in 
Section 3 for cr^ known. 

For both the posterior predictive and the partial 
posterior predictive measures, we need to specify a 
new (noninformative) joint prior. Since we can use 
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the standard noninformative prior for cj^, we take 



(5.1) 



cr^ r 



To simulate from the posterior distribution, we again 
use Gibbs samphng. The full conditionals for 9, ij, 
and are the same as for the known cr^ and they 
are given in (B.3), (B.l) and (B.2), respectively. The 
full conditional for the new parameter, cj^, is 

2 \ n 2 -2/ ~2\ 

where m = Ef=i and a'^ = J2i=iJ2]Lii^ij - 
e^f|n. 

The (joint) partial posterior distribution is 

7r(0,cr2,//,r2|xo6s) 



f{tobs\0,a^) 



and again we use the same general procedure as for 
the cj^ known scenario (see Section 3). We only need 
to specify how to simulate from the full conditional 
of a^: 



T^pppicr'^ I 0,iJ,,T'^,:x.obs \ tabs) oc 



f{tobs\e,a^y 

We use Metropolis-Hastings with x~^(TO', 5^) as pro- 
posal distribution. The acceptance probability (at 
stage k) of candidate cr^*, given the simulated val- 
ues 



a = min< 1 



f{Ubs\e'^'^\a'^^)] 
/(t„,,|0W,a2*) 

We next derive the different MS for O'Hagan data. 

O'HAGAN (2003) EXAMPLE (CONTINUED). The 

empirical Bayes, posterior predictive and partial pos- 
terior predictive MS applied to this data set, using 
T = maxjjXj}, are shown in Table 4. 

We again observe the same behavior as the one 
repeatedly observed in previous examples: in spite 
of such an "obvious" data set, only the partial pos- 
terior measures detect the incompatibility between 
data and model. The empirical Bayes prior measures 
are too conservative, and the posterior predictive 
measures (and their very much alike empirical Bayes 
posterior ones) are completely hopeless. 

5.2 Simulation-Based IVIodel Checking 

This method is proposed in 
Dey, Gelfand, Swartz and Vlachos (1998), as a com- 
putationally intense method for model checking. 



This method works not only with checking statis- 
tics T, but more generally, with discrepancy mea- 
sures d, that is, with functions of the parameters 
and the data. This feature also applies to the pos- 
terior predictive checks that we have been consid- 
ering all along. In essence, the method consists in 
comparing the posterior distribution d \ x^bs with R 
posterior distributions of d given R data sets x^, for 
r = 1, . . . ,R, generated from the (null) prior predic- 
tive model. Note that this method requires proper 
priors. Comparison is carried out via Monte Carlo 
Tests (Besag and Clifford (1989)). 

Letting x'', for r = 0, denote the observed data 
Xo6s, their algorithm is as follows: 

• For each posterior distribution of d given x^',r = 
0, . . . ,R, compute the vector of quantiles q^^^ = 

iQo%^Qo%,Qo.hQo.75^Qo%)^ where q^a^ is the a- 
quantile of the posterior distribution given data 
x\ r = 0,...,R. 

• Compute the vector q of averages, over r, of these 

quantiles: q = (go.05> 9o.25>^o.5> 9o.75> 9o.95)- 

• Compute the r + 1 Euclidean distances between 



0,1, . . . , R and q. 



• Perform a 0.05 one-sided, upper tail Monte Carlo 
test, that is, check whether or not the distance 
corresponding to the original data is smaller than 
the 95th percentile of the r + 1 distances. 

In reality, this method is not a competitor of the 
ones we have been considering previously, since it 
requires proper priors, and hence is not available for 
objective model checking. We, however, apply it also 
to O'Hagan data. 

O'Hagan (2003) example (continued). In 
order to perform the simulation-based model check- 
ing, we need proper priors. We use the ones proposed 
in O'Hagan (2003): 

/i~A^(2,10), o-^~22T^, T^r^6W 
(5.2) _ 

where W ~ X20 ■ 

Along with the statistic used so far, we have also 
considered a measure of discrepancy which in this 
case is just a function of the parameters: 

Ti=maxXi., T2 = max \9i — 

With 1000 simulated data sets from the null, the 
results are shown in Table 5. It can be seen that, 
with the given prior, incompatibility is detected with 
T2, but not with Ti. We do not know whether T2 
would detect incompatibility with other priors (see 
related results in Section 5.3). 
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Table 4 

MS (a^ unknown) for O^Hagan data set 



f prior 




Ppost 


T> TtQEB 

'J post 


Ppost 


RPS post 


Pppp 


RPS ppp 


0.19 


0.4 


0.37 


0.95 


0.40 


0.99 


0.01 


0.02 



5.3 O'Hagan Method 

O'Hagan (2003) proposes a general method to in- 
vestigate adequacy of graphical models at each node. 
We will not describe his method in full generality, 
but only how it applies to checking the second level 
of our normal-normal hierarchical model. 

To investigate conflict between the data and the 
normal assumption for each of the group means, this 
proposal investigates conflict between the likelihood 



Table 6 

Posterior medians of Ci, i — 1, ... ,5, for O'Hagan data set 



for e,,U]Llf{x^ 



9i,a ), and the (null) density for 



TT 



To check conflict between two known univariate 
densities/likelihoods, O'Hagan proposes a "measure 
of conflict" based on their relative heights 
at an "intermediate" value. Specifically, the likeli- 
hoods/densities are first normalized so that their 
maximum height is 1 (notice that this is equiva- 
lent to dividing by their respective maximum, as 
in RPS before). Then the (common) density height, 
z, at the value of 9i between the two modes where 
the two densities are equal, is computed. The pro- 
posed measure of conflict is c = —2 In z. For the par- 
ticular case of comparing two normal distributions, 
iV(wj,7?), for i = 1,2, this measure is 

(5.3) c=( "^-"^ 

O'Hagan indicates that a conflict measure smaller 
than 1 should be taken as indicative of no conflict, 
whereas values of 4 or larger would indicate clear 
conflict. No indication is given for values lying be- 
tween 1 and 4. 

When, as usual, the distributions involved depend 
on unknown parameters, the measures of conflict [in 

Table 5 
Euclidean distance between q'"^ 
and q and the 0. 95 quantile of all 
distances 





llq^^'-qll 


0.95 quantile 




2.31 


13.46 


T2 


1.82 


0.81 





6*1 


6*2 


6*3 


6*4 


6I5 


O'Hagan priors 
Noninformative priors 


0.43 
0.16 


0.14 
0.09 


0.22 
0.11 


0.46 
0.16 


4.81 
1.36 



particular (5.3)], cannot be computed. O'Hagan's 
proposal is then to use the median of their posterior 
distribution. Notice that this is closely related to 
computing a relative height on the posterior predic- 
tive distribution and, hence, the concern exists that 
it can be too conservative for useful model check- 
ing. In fact this conservatism was highlighted in the 
discussions by Bayarri (2003) and Gelfand (2003). 

Interestingly enough, O'Hagan defends use of pro- 
per priors for the unknown parameters, so neither 
posterior predictive nor posterior distributions are 
needed for implementation of his proposal (since the 
prior predictives and priors are proper). Alterna- 
tively, if one wishes to insist on using posterior dis- 
tributions (instead of the, more natural, prior distri- 
butions), then proper priors are no longer needed, 
and the method can thus be generalized. Accord- 
ingly, we also apply his proposal with the noninfor- 
mative prior (5.1). 

O'Hagan (2003) example (continued). We 
compute the measure (5.3) for the data set proposed 
by O'Hagan (2003). To derive the posterior distri- 
butions, we use both the proper priors proposed by 
O'Hagan for this example, given in (5.2), and the 
noninformative prior (5.1). The posterior medians 
for c are shown in Table 6. It can be seen that the 
results are very dependent on the prior used: the 
spurious group 5 is detected with the specific proper 
prior used, but not with the noninformative priors 
(thus suffering from the expected conservatism) . We 
recall that data were clearly indicating an anoma- 
lous group 5. 

5.4 "Conflict" p-Value 

Marshall and Spiegelhalter (2003) proposed this 
approach based on, and generalizing, cross-validation 
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methods (see Gelfand, Dey and Chang (1992); 
Bernardo and Smith (1994), Chapter 6). 

In cross- vahdation, to check adequacy of group i, 
data in group i, Xj, are used to compute the "sur- 
prise" statistic (or diagnostic measure) , whereas the 
rest of the data, X_i, are used to train the improper 
prior. A mixed p- value is accordingly computed as 

(5.4) pi^mi. = Pr™— (•l^-)(I-, > T/^^), 

where the completely specified distribution used to 
compute the ith p- value is 

'^crossi^i I i) 

= J f{ti I ei,a^)7T{e^ I IX,T^)TT{ti,T^a^ I X_i)d0, 

and thus there is no double use of the data. 

Marshall and Spiegelhalter (2003) aim to preserve 
the cross-validation spirit while avoiding choice of 
a particular statistic or discrepancy measure Tj = 
T(Xi). Specifically, they propose use of conflict 
p- values for each group i, computed as follows: 

• Simulate 9^^^ from the posterior 6i \ X_j. 

• Simulate Of^ from the posterior 0j | Xj. 
. Compute 9f^ = 9rP-et. 

• Compute the "conflict" p- value for group i,i = 

as 

(5.5) P^,con = Pr{9fff <0\^). 

Marshall and Spiegelhalter (2003) show that for 
location parameters 9i, the conflict p-value (5.5) is 
equal to the cross-validation p-value (5.4) based on 
statistics 9i with symmetric likelihoods and using 
uniform priors in the derivation of 9f^. 

A clear disadvantage of this approach (as well as 
with the cross-validation mixed p- values) is that we 
have as many p- values as groups, and multiplicity 
might be an issue. (O'Hagan's measures might suf- 
fer from it too.) Since we are dealing with p- values, 
adjustment is most likely done by classical methods 
[controlling either the family-wise error rate, as the 
Bonferroni method, or the false discovery rate and 
related methods, as the Benjamini and Hochberg 
(1995) method]. None of these methods is foolproof 
and the danger exists that they also result in a lack 
of power. 

O'HAGAN (2003) EXAMPLE (CONTINUED). We 

compute the conflict p- values for the O'Hagan data 
set. We again use both, O'Hagan priors and non- 
informative priors. The results are shown in Table 
7. Taken at face value, these p-values behave nicely 
and detect the outlying group. 



Table 7 

Conflict p-values for the O^Hagan data set using 
noninformative priors and O'Hagan priors 

Group 1 Group 2 Group 3 Group 4 Group 5 

O'Hagan priors 0.84 0.74 0.73 0.88 0.00 
Noninformative 0.66 0.59 0.61 0.68 0.00 



6. A BINOMIAL-BETA EXAMPLE: BRISTOL 
ROYAL INFIRMARY INQUIRY DATA 

We finish the paper with a real example and a 
different hierarchical model. Specifically, we exem- 
plify the different checking procedures in a hierar- 
chical Binomial-Beta model on a data set analyzed 
at length in Spiegelhalter et al. (2002). Data consist 
in the number rii of open-heart operations and the 
corresponding number Yi of deaths for children un- 
der one year of age carried out in 12 hospitals in 
England. Data are shown in Figure 4. 

We consider the following model: 

Yi\9i '^Bm{9i,ni), i = l,...,I, 
I 

ir{e I a, 13) = Y[Bet&{9i \a,f3), 

i=l 

(6.1) 7r(a,/5)oc[(^i(a)-V'i(Q + /3)) 
•(^i(/?)-^i(a + /?)) 

where 7r(a,/3) is the Jeffreys prior (Yang and Berger 
(1997)), and Tpii^) = J2Zi{x + i)''^ denotes the tri- 
gamma function. We use both the maximum and 
the minimum of the frequencies of deaths, yi/rii, 
as checking statistics. Also, when simulating from 
the partial distributions we have used the normal 
approximation to the binomial, yi/rii ~ N(9i,9i{l — 
9i)/ni), so that the conditional distribution of the 
maximum and the minimum has an easy closed form 
expression. 

We compute the overall partial and posterior pre- 
dictive p-values, and also the individual (one for 



Table 8 

p-values for the mortality in pediatric cardiac surgery 





fprior 


Ppost 


Ppost 


Pppp 


Maximum 


0.03 


0.16 


0.23 


0.00 


Minimum 


0.67 


0.56 


0.62 


0.64 
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each hospital) O'Hagan's conflict measures and Mar- 
shaU and Spiegelhalter's conflict p-values. AU re- 
quire MCMC. We use 30,000 simulations after a 
warm-up period of 10,000. Algorithms in R are avail- 
able in http://bayes.escet.urjc.es/~mecastellanos/ 
Funct ionsBristol . zip . 

The overall p- values (EB prior, EE posterior, pos- 
terior and partial posterior) appear in Table 8. Also, 
in Figures 5 and 6 we show the corresponding pre- 
dictive distributions for, respectively, the maximum 
and the minimum. Both the flgures and the table 
show that the observed minimum is well supported 
by the assumed models with any of the p-values 
used. However, with the maximum, the EB prior and 
partial posterior show incompatibility (with the ppp 
showing more incompatibility than the EB prior), 
while the EB posterior and posterior p-values fail to 
do so. 

The multiple conflict measures are in Table 9, and 
the multiple conflict p-values in Table 10. In these 
tables, "1" refers to the hospital with the lowest 
mortality rate, and "10" to the one with the largest. 
According to O'Hagan's prescriptions, no hospitals 
show clear indication of incompatibility; all but Bris- 
tol are compatible. On the other hand, the multiple 



Table 9 

Posterior medians of Ci, i — 1,. . . , 12, for Bristol data set 



123456789 10 11 12 

0.51 0.09 0.07 0.06 0.06 0.05 0.05 0.05 0.10 0.19 0.64 3.11 
Hospitals are ordered from lowest to largest mortality rate. 

conflict p-values isolates Bristol as the only one in- 
compatible. No correction for multiplicity has been 
used. 

7. CONCLUSIONS 

In this paper we have investigated the checking 
of hierarchical models from an objective Bayesian 
point of view (i.e., introducing only the information 
in the data and model). We have explored several 

Table 10 
Conflict p-values for each hospital 



123456789 10 11 12 

0.89 0.72 0.70 0.71 0.70 0.66 0.46 0.47 0.42 0.35 0.17 0.00 
Hospitals are ordered from lowest to largest mortality rate. 



Deaths by operations in 12 hiospitais in England 



S- 

o 

■2 _ 




o 

S3 - oo 

o o 



200 SOO 400 SCO 

Fig. 4. Number of open-heart operations and deaths for children under one year of age carried out in 12 hospitals in England 
between 1991 and 1995. 
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ways of eliminating the unknown parameters to de- 
rive "reference" distributions. We have also explored 
different ways of characterizing "incompatibility." 
We propose use of the partial posterior predictive 
measures (MSppp), which we compare with many 
other proposals. Some of our findings are: 

• MSppp behave considerably better than the al- 
ternative MS^r^o^, MS^f.t and MS post- The be- 
havior of MSpost can be particularly bad with ca- 
sually chosen T's, failing to reject clearly wrong 
models (but notice that the specific T we use is the 
one proposed in Gelman, Carlin, Stern and Rubin 
(2003), Section 6.8). As a matter of fact, the mea- 
sures MSpost are very similar to the clearly inap- 
propriate MSpogi- 

• In our (limited) simulation study, the null sam- 
pling distribution of Pppp is found to be approxi- 
mately uniform, while those of p^^^^ and Ppost are 
far from uniformity. Also, Pppp is the most power- 
ful for the considered alternatives. 

• The simulation-based model checking seems to 
work well in detecting the incompatibility between 
the model and the data, but it requires proper pri- 
ors. 

• The O'Hagan method is highly sensitive to the 
prior chosen, and in fact it seems to be conserva- 
tive with noninformative priors. 

• The conflict p-values Pi^con seem to work well, 
but they produce as many p-values as number of 
groups and multiplicity might be an issue. Also, 
the resulting p- values will typically be highly de- 
pendent (any two p-values are based in the same 
data except for two observations). 

Partial posterior p- values are not as easy to compute 
as posterior p-values, but they are still relatively 
easy, and indeed nothing more sophisticated than 
R was needed for the computations in this paper. 
This, along with their good properties (as demon- 
strated along the paper), makes them the clearly 
recommended procedure for objective model check- 
ing when the testing statistic T is not (nearly) an- 
cillary. But if computation is perceived as an over- 
whelming reason in favor of posterior p-values, we 
recommend instead use of the EB-prior p- values: 
they have better properties and are easier to com- 
pute. 

APPENDIX A: MC COMPUTATIONS FOR 
SECTION 3.1 

To simulate from the empirical Bayes prior pre- 
dictive distribution mp^Q^(x) simply proceed as fol- 



lows: For / = 1, . . . , M simulate 

= (0i(;) , . . . , 0,(;)) ~ 7r^^(0) = n I A, r''), 

i=l 

and for each 9(^i^,l = 1, . . . , M, simulate 

X(0 = (^i.(i),...,X7.(;)) 
/ 

~/(xi0(o)=n/(^-i^^(o)- 

Simulations for the empirical Bayes posterior pre- 
dictive rripggf (x) proceed along the same lines except 
that 6 is now simulated from 



where 



1=1 



and 



Vi 



ni/a2 + l/f2- 



APPENDIX B: FULL CONDITIONAL FOR 
THE GIBBS SAMPLER IN SECTION 3.2 

To simulate from the joint posterior (3.5) we use 
an easy Gibbs sampler defined by the full condition- 
als 



;u|0,r2,Xo5,-Af(^^,y^; 



(B.l) 



(B.2) 



(B.3) 



with E,, 



J — and = y 



where 



/-I 



N{Ei,Vi),wheve 



All the full conditionals are standard distributions, 
trivial to simulate from. x~'^{^tO,) refers to a scaled 
inverse chi-square distribution: it is the distribution 
of {ija)/Y where Y ~ X^i'^)- 
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Fig. 5. Predictive distribution for T = max{j/i/ni} in the Bristol Royal Infirmary data. 



APPENDIX C: DETAILS FOR MCMC 
COIVIPUTATIONS IN SECTION 3.3 

The full conditionals for the Gibbs sampler are 

7r(^ I 0,r2,Xof, 



jbs J 



(C.l) 



r \9,fi,:x.obs\tobs 



f{tobs I 9) 
7r(T2 I e,^,y.obs) 



f{tobs I ^) 



(C.2) 

(C.3) 9\ IJ,,T^,Xobs\tobsOC 



(xtt{t^ I O,^,y.obs), 

■K{e I /i,r2,Xo6s) 



f{tobs I 9) 

The full conditionals (C.l) and (C.2) are identi- 
cal to (B.l) and (B.2), respectively, and hence they 
are easy to simulate from. Equation (C.3) is not of 
closed form, and we use Metropolis-Hastings within 
Gibbs for the full conditional of each Oi: 

T^ppp (^i I f^j 9 —ij ^obs \ tabs ) 
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Fig. 6. Predictive distribution for T = min{yi / rii} in the Bristol Royal Infirmary data. 



OAS 



(C.4) 



oc 



oc 



f{tobs I ^) 

N{9i\E,,Vi) 



f{tobs I 

where Ei^, Vi are given in (B.3). Next we need to find 
a good proposal to simulate from (C.4). An obvi- 
ous proposal would simply be the posterior TTposti&i \ 
//,r^,Xofts), but this can be a very bad proposal 
when the data are indeed "surprising" for the en- 
tertained model. In particular, the posterior distri- 



bution centers around the MLE 6 while the par- 
tial posterior centers around the conditional MLE, 
^cMLE, that is, 

OcMLE = argmax/(xo6s I tobs,0) 
fi^obs I 0) 



argmax 



fitobs I ^) 



It is intuitively obvious that, when the data are 
not "surprising," that is, when tobs comes from the 
"null" model, then f{xobs \ tobs,d) would be similar 
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to /(xo5s I 6) and the partial and posterior distribu- ^ T^post{(^i \t i ^ii • • • , ^j-i; ^i+i, • • • , &i,^obs) 



tions would also be similar. However, when the data f{tobs I ^i) • • • > • • • > 



oc exp<^ -7: ^ + 



oc exp < — - 




are "surprising" and tots is not a "typical" value, 
then the "null" model and the conditional model can 

i_yv K^^jj 1 1 2 ' 2/^ * /2_|_-|/2 

be considerably different, as well as the correspond- ^ v'^i t / \ ''^il^i + J-/r 

ing MLEs. For Metropolis proposals, ri(Z]-^)^/ Z)^ ^' 

Bayarri and Berger (2000) then suggest generating ■exp'^ -—^ — ''—nitobs — - 

from the posterior distribution but then "moving" ^ ^3 ^ j ^ 3 

the generated values closer to the mode of the target ( 1 / 2 / 1 

distribution (the partial posterior) by adding oc6^P|~2 (^^* ~^ 

(^cMLE,i — dMLE,i, ^nf^'^i- 1 

multiplied (when this results in improved mixing) by 
a Uniform(0,l) random generation. This and other 
algorithms for computing conditional distributions . exp 

are presented in Bayarri, Castellanos and Morales 
(2006). 

To avoid computation of Ocmle, which can be 
rather time consuming, we use instead an estimate 
Oc which we expect to be close enough (for our / 
purposes) to Ocmle for this model and this T (see — 29i —^Xi. H — j 

Bayarri and Morales (2003)). In particular, we take 
all components to be equal and given by 

, EUX,, ^3^ 
J _ 1 ' / 

• I y^.njtobs — y^^niOi 

where {Xf^i.-j, . . . ,X(^j.^) denote the group means \ j 

sorted in ascendent order. That is, we simply remove 
the largest sample mean and then average (we could which, after some algebra, reduces to 
have also used a weighted average if the sample sizes ^fa \ ^2 n n a a ^ \ + \ 

^ T^{t>i\T ,t>l,...,t)i^i,Ui+i,...,t)l,Xobs\tobs) 

were very dinerent). 

Then, the resulting algorithm to simulate from occxpf ^ {9.i E^)^\ 

(C.4) at stage k, given the (simulated) values (0^^, 0^ I '^K i ' 

k 2(fc)N • 

^ with and given in (4.6) and (4.7), respec- 

1. Simulate 6* r^N{0i \ Ei,Vi). tively. The result then follows if V^P can be shown 

2. Move the simulation 0* to to be greater than 0, which is true because 1 — 

0* = 0* + U-i0c-9MLE,i), E^T^^ 
where U is random number in (0,1). 

3. Accept candidate 0* with probability 



rii 
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APPENDIX D: DERIVATION OF THE FULL 
CONDITIONAL OF O'S IN SECTION 4.3 

The full conditional partial posterior density for 
0i is 

■K{0i I r^, 01, ... , 0j+i, . . . ,0i,Xobs \ tabs) 
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